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ABSTRACT 

We propose a local modelling approach using deep convolu¬ 
tional neural networks (CNNs) for fine-grained image clas¬ 
sification. Recently, deep CNNs trained from large datasets 
have considerably improved the performance of object recog¬ 
nition. However, to date there has been limited work using 
these deep CNNs as local feature extractors. This partly stems 
from CNNs having internal representations which are high di¬ 
mensional, thereby making such representations difficult to 
model using stochastic models. To overcome this issue, we 
propose to reduce the dimensionality of one of the internal 
fully connected layers, in conjunction with layer-restricted re¬ 
training to avoid retraining the entire network. The distribu¬ 
tion of low-dimensional features obtained from the modified 
layer is then modelled using a Gaussian mixture model. Com¬ 
parative experiments show that considerable performance im¬ 
provements can be achieved on the challenging Fish and UEC 
FOOD-100 datasets. 

Index Terms — fine-grained classification, deep convolu¬ 
tional neural networks, session variation modelling, Gaussian 
mixture models. 


1. INTRODUCTION 

Fine-grained image classification refers to the task of recog¬ 
nising the class or subcategory (for instance the particular fish 
species) under the same basic category such as bird or fish 
species EEl. This is a challenging task for two reasons. 
First, some classes (species) from the same category, such 
as fish, can appear to be very similar in terms of appearance 
leading to low inter-class variation. Second, there is a high 
degree of variability in the instances of the same classes due 
to environmental and illumination variations leading to high 
intra-class variation. Fig. shows examples of both issues. 

An approach to tackling these two issues is to extract lo¬ 
cal region descriptors and to model them. Such an approach 
has previously been popular for recognition of faces (TT] [Tbl 
and fish m. These approaches typically divide the image into 
patches (or blocks), with each patch considered to be an inde¬ 
pendent (and partial) observation of the object. Each patch is 
then represented by a feature vector and the distribution of all 
of these features vectors, from an image, is then modelled us¬ 
ing a Gaussian mixture model (GMM). The feature vector to 
represent each patch has usually been obtained from a trans¬ 
form such as the 2D discrete cosine transform ca. 
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Fig. 1 : First two rows show example images of four fish species, 
which have low inter-class variation: similar visual appearance de¬ 
spite being distinct species. (Images taken by J.E. Randall). The 
last two rows show images of four food dishes, with each dish type 
having high intra-class variation. 


Recently, feature learning through the use of deep con¬ 
volutional neural networks (CNNs) has led to considerable 
improvements for object recognition mol . These deep CNN 
feature representations are trained on large datasets such as 
ImageNet 13 which has 1,000 general object categories. It 
has been shown that these learnt features can be used to ob¬ 
tain impressive results for other recognition tasks when used 
as a global image representation |[T4l . However, to the best of 
our knowledge no work has examined how to use these learnt 
features as a local feature extractor for use with well known 
statistical modelling approaches such as GMMs. 

To use these deep CNN features as a local feature extrac¬ 
tor two issues need to be addressed. First, deep CNNs such 
as Col generally have an internal representation which is high 
dimensional, leading to the curse of dimensionality O for lo¬ 
cal modelling techniques such as GMMs. Second, we need 
to develop an efficient and effective method to retrain a deep 
CNN containing millions of weights using a relatively small 






set of images specific to a fine-grained class. In this paper we 
address both of these issues. 

Inspired by recent work that has shown how to optimise 
deep CNN features for small datasets using fine-tuning ifTTll . 
we propose a method to obtain a low-dimensional deep CNN 
representation that can be used as a local feature descriptor. 
Specifically, we propose to explicitly reduce the dimensional¬ 
ity of one of the internal fully connected layers, in conjunc¬ 
tion with using layer-restricted retraining to avoid retraining 
the entire network. We demonstrate empirically that the pro¬ 
posed approach leads to considerable performance improve¬ 
ments for two fine-grained image classification tasks: fish 
recognition (H and food recognition lIT^ . 

We continue the paper as follows. In Sectionwe briefly 
describe the image classification approach based on statisti¬ 
cal modelling of local features and inter-session variability 
modelling. The approach is used as a base upon which we 
build on in Sectionj^ where we learn a low-dimensional deep 
CNN representation that can be used as local feature descrip¬ 
tor. Comparative experiments are given in Section]^ followed 
by the main findings and future directions in Section 

2. MODELLING LOCAL IMAGE FEATURES 

Modelling the distribution of local features has been explored 
by several researchers |[IIl|T6l[T3. In general, these methods 
divide the j-th image of the i-th class, into N overlap¬ 
ping patches. Each patch is represented by an M-dimensional 
feature vector, of low dimensionality, to yield the set of N 
feature vectors Oi j = ..., The distribution 

of the vectors is then modelled using a GMM to obtain a prior 
model, referred to as a universal background model (UBM), 
that represents the basic category in question (eg. fish, food). 

This UBM representation forms the basis which many 
feature modelling methods use. It can be used as a probabilis¬ 
tic bag-of-words representation Ha or a model can be de¬ 
rived for each class by performing mean-only relevance MAP 
adaptation CD- Another extension is to perform inter-session 
variability (ISV) modelling ca which learns those variations 
that can make one instance (image) of the same class look dif¬ 
ferent to another image of the same class. 

Irrespective of the specific method they all rely on a GMM 
which is known to perform poorly for high-dimensional 
data El. This is partly due to the curse of dimensionality 
where it becomes difficult to estimate a large number of pa¬ 
rameters when there is limited data. To avoid this we will 
show how to learn a low-dimensional deep CNN representa¬ 
tion, however, before proceeding to this we first describe the 
GMM feature modelling methods that we use in this work. 

2.1. GMM Feature Modelling 

We use two feature modelling approaches in this work, GMM 
mean-only MAP adaptation and its extension ISV. These two 
are chosen as they have been shown to provide consistently 
good performance Ga. 


GMM mean-only MAP adaptation takes the prior model 
(UBM) and adapts just the means using the enrollment data 
of the i-th class Oi ; all of the features for the Ji enrollment 
images. Using supervector notation El, this is written as 

Si = m + Dzi, ( 1 ) 

where Si is the mean supervector for the i-th class, m is the 
mean supervector of the UBM (the prior), 2 ^^ is a normally 
distributed latent variable, and D is a diagonal matrix that 
incorporates the relevance factor and the covariance matrix 
and ensures the result is equivalent to mean-only relevance 
MAP adaptation. 

ISV is an extension of the GMM mean-only MAP model 
which learns a sub-space which models and suppresses ses¬ 
sion variation El It includes a subspace U to cope with 
session variation and is written in supervector notation as 

Uij = m + + Dzi, (2) 

where is the latent session variable and is assumed to 
be normally distributed. Suppressing the session variation 
is done by jointly estimating the latent variables 2 :^ and 
[xi^i,... Xi^jJ followed by discarding the latent session vari¬ 
ables to give 

sisv,i=rn^ Dzi, ( 3 ) 

For both of these methods, the log-likelihood ratio is used 
to determine if the t-th test image It was most likely produced 
by class i. This is efficiently calculated using the linear scor¬ 
ing approximation Q which for GMM mean-only MAP is 

hlinear {Of, Si) = (Si - m)^ (4) 

and for ISV it is 

hiSV (Ot,Si) = (s/sV,i - mf ^ {ft\m - 5 

where the diagonal matrix I] is formed by concatenating the 
diagonals of the UBM covariance matrices, /t|m, is the super¬ 
vector of mean normalised first order statistics, and Nt con¬ 
tains the zeroth order statistics for the test sample in a block 
diagonal matrix El- 

3. PROPOSED METHOD 

To extract features from local patches, we aim to learn a 
low-dimensional deep CNN representation which we refer 
to as a low-dimensional CNN feature vector (LDCNN). This 
is in contrast to the high dimensional representation (4096 
dimensions) that is usually obtained from the fully connected 
layer (fc-6) of the pretrained deep CNN oni, the structure of 
this network can be seen in Fig. Such high dimensional 
representations are difficult to be effectively modeled with a 
stochastic model such as a GMM, as such we aim to learn a 
low-dimensional representation (LDCNN) whose dimension¬ 
ality M is much less than 4096. To reduce the dimensionality 
while preventing the parameters from overfitting in the large 
CNN architecture, we propose a two step modification for the 
network. 
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Fig. 2: Modifying and retraining the deep CNN through a 2 step 
procedure. For each step we have shaded in green the parts of the 
network that are changed and retrained. First step: the highlighted 
fc-8 layer is modified to have only as many outputs as the number 
of dataset specific classes. The layer is retrained, while all the other 
parameters remain fixed. Second step: the highlighted fc-6 layer is 
changed to map to only M outputs, followed by training the fc-6 
layer in conjunction with the highlighted fc-7 layer, while keeping 
the remaining parameters fixed. The output of the fc-6 layer is used 
as a local feature extractor. 


In the first step, using the pretrained network of m as a 
starting point, we modify the final output layer (fc-8) to have 
outputs for the Nc training classes. The weights are randomly 
initialisec|^ and retraining is then conducted such that only 
the fc-8 layer is updated using a learning rate of 0.01. This 
process equates to a multiclass linear regression, using the 
pretrained network as a feature extractor. It converges after a 
few thousand iterations. 

In the second step we replace the two fully connected lay¬ 
ers fc-6 and fc-7 and retrain only these two layers with the 
other layers fixed. We replace the original 4096 dimension 
fc-6 layer with a new M-dimensional fc-6 layer that is ran¬ 
domly initialised??, where M < 4096. Features extracted 
from this layer are referred to as LDCNN. The fc-7 layer is 
also replaced and randomly initialised?? as fc-6 and fc-7 are 
densely connected. However, when we retrain the network, 
fc-7 retains its original dimensionality of 4096. Retraining is 
then performed using back propagation and stochastic gradi¬ 
ent descent to update only these two layers. The learning rate 
is initially set to 0.01 but this rate reduces by a factor of 10 
for every 1000 iterations throughout training process. In this 
way, all pretrained convolutional layer filters from the original 
network cni are retained. 

4. EXPERIMENTS 

We evaluate our approach on two fine-grained image datasets: 
Fish Q and UEC FOOD-100 ifT^ . For both datasets we 
present two baseline systems, both of which perform classi¬ 
fication using an SVM and extract a single global CNN fea- 

^ Random initialisation is performed by drawing from J\f (O, 0.01^). 


ture to represent each image. The first baseline extracts a sin¬ 
gle global feature vector using fc-6 of the pre-trained deep 
CNN ITOi (4096 dimensions); we refer to this as SVM-CNN. 
The second baseline extracts a single global feature vector 
using the re-trained low-dimensional CNN feature (FDCNN) 
vector; we refer to this as SVM-LDCNN. 

The local features modelling results (GMM), where the 
image is divided into N overlapping patches, use two feature 
extractors. These feature extractors obtain an M-dimensional 
feature vector from each of the N patches which is then mod¬ 
elled using a GMM. The first, GMM-LDCNN, uses the pro¬ 
posed low-dimensional CNN feature vector (FDCNN) to ob¬ 
tain the M-dimensional feature vector. The second, GMM- 
PCA-CNN, uses fc-6 pre-trained deep CNN lITOl (4096 di¬ 
mensions) and learns a transform using principal component 
analysis (PCA) ii to reduce the dimensionality to M. 

When we perform local feature modelling (GMM) a range 
of parameters are varied. The number of components evalu¬ 
ated for the GMM were C = [128, 256,512,1024], the size of 
the ISV subspace was Nu = [2,4,8,..., 256], and the range of 
block sizes B = [32,64,96,128]. For both datasets the images 
were resized to be 256 x 256. Caffe m was used to extract 
and retrain the CNN features and Bob 111 was used to learn 
the GMM and ISV models. 

4.1. Fine-Grained Fish Classification 

We use the Fish image dataset from 111 which consists of 
3,960 images collected from 468 species. This dataset con¬ 
tains images captured in different conditions, defined as “con¬ 
trolled”, “out-of-the-water” and “in-situ”. The “controlled” 
images consist of fish specimens with controlled background 
and illumination. The “in-situ” images are underwater images 
of fish in their natural habitat and the “out-of-the-water” im¬ 
ages consist of fish specimens taken out of the water with a 
varying background. 

Following the defined protocols, the dataset is split into 
three sets: a training set (train) to learn/derive UBM GMM 
models; a development set (dev) to determine the optimal pa¬ 
rameters and decision threshold for our models and an evalua¬ 
tion set (eval) to measure the final system performance. There 
are two protocols: protocol la evaluates the system perfor¬ 
mance when high quality (“controlled”) data is used to en¬ 
rol classes and protocol lb evaluates the system performance 
when low quality (“in-situ”) data is used to enrol classes. For 
both protocols, the same test imagery (a mix of “controlled”, 
“in-situ” and “out-of-the-water” images) is used. The local 
modelling approach used for these experiments was the ISV 
extension of the GMM approach as this provided a consid¬ 
erable boost for the initial experiments; we refer to this as 
GMM-LDCNN. 

It has been shown in m that incorporating spatial infor¬ 
mation can be advantageous, and as such we further propose 
to extend the GMM-LDCNN approach by adding the spatial 
location (x, y) to each local feature vector prior to modelling; 

















































Table 1: Results on the Fish image dataset d. The two 
baseline approaches, SVM-CNN and SVM-LDCNN, are presented 
along with the state-of-the-art local modelling approach from (T) 
(Local GMM). GMM-PCA-CNN uses PC A reduced features 
from fc-6 of the pre-trained CNN do). The proposed GMM- 
LDCNN method uses LDCNN features in conjunction with GMMs. 
GMM-LDCNN-xy extends LDCNN features by adding the spatial 
location of each block. 


System 

Protocol la 
Dev Eval 

Protocol lb 
Dev Eval 

SVM-CNN 

40.9 

45.8 

41.9 

45.7 

SVM-LDCNN 

39.2 

44.2 

40.3 

43.5 

Local GMM d 

43.1 

49.3 

40.8 

46.7 

GMM-PCA-CNN 

45.7 

51.5 

44.0 

47.2 

GMM-LDCNN 

51.8 

55.5 

46.4 

49.5 

GMM-LDCNN-xy 

53.8 

57.0 

46.2 

53.3 


we refer to this method as GMM-LDCNN-xy. 

The results in Table show that in contrast to global fea¬ 
tures, local modelling provides notable improvements: the 
two baseline systems (SVM-CNN and SVM-LDCNN) which 
use global features perform worse than the previous state-of- 
the-art local ISV modelling approach (Local GMM). Further¬ 
more, our local low-dimensional GMM-LDCNN approact0 
outperforms local modelling of PCA-CNN features (GMM- 
PCA-CNN), with an average relative performance improve¬ 
ment of 6.4%. The extended form of the proposed approach 
(GMM-LDCNN-xy) provides further improvements and ob¬ 
tains state-of-the-art results, with an average relative perfor¬ 
mance improvement of 14.9% over Local GMM Q. This 
demonstrates the effectiveness of local modelling over global 
features, and highlights the potential to use feature learning 
techniques such as CNNs to learn effective local representa¬ 
tions. 

4.2. Results on Food Dataset 

We use the UEC FOOD-100 dataset which contains 100 
Japanese food categories with more than 100 images for each 
category. Some images contain multiple classes and a bound¬ 
ing box is provided for each class. Examples are shown in 
Fig. [2 Features are extracted from the bounding box only, so 
detection/localisation is not considered in this paper. 

We use half of the images from each class for training 
and the other half for testing The training images are used 
for retraining the CNN and to learn the UBM model. The 
dimensionality for fc-6 is set to M = 256 based on initial 
experiments. Initial experiments also indicated that the ISV 
extension to local modelling and including spatial (x,y) in¬ 
formation in each feature vector did not provide performance 

^Optimal parameters for protocol la were C — 1024, B = 128, and 
Nu = 128, while for protocol VoC — 512, B = 96, and Nu = 128. 

^We developed these protocols as insufficient details were provided to 
reproduce the experiments in (9|; our protocol files will be publicly available. 



Fig. 3: Rank-n classification accuracy on the UEC FOOD-100 
dataset IT2l . 

improvements. As such, they were not used on this dataset. 
We believe that ISV did not lead to increased performance as 
this is a closed-set problenj^with a high number of enrollment 
images, resulting in less effective learning of a representation 
for session variation independent of the class. The spatial in¬ 
formation did not help as the images are not accurately reg¬ 
istered, consequently modelling the location of parts (such as 
the eggs in Fig.[^ is not useful. 

The results, presented in Fig.[^ show that performing lo¬ 
cal modelling using the LDCNN features (GMM-LDCNN) 
provides the best performanc^ The results in Fig. are pre¬ 
sented in terms of rank-n classification accuracy, where rank- 
n refers to if the class of interest is in the n best matches. 
In terms of rank-1 accuracy (identification accuracy), local 
modelling of the LDCNN features (GMM-LDCNN) has an 
accuracy of 58.3%, which provides a considerable relative 
performance improvement of 9.4% compared to the SVM- 
LDCNN approach (using LDCNN to extract a global feature) 
which has an accuracy of 52.9%. The GMM-LDCNN ap¬ 
proach also outperforms the SVM-CNN approach which is 
similar to the best single feature system presented in ||9l (re¬ 
ferred to as DCNN in their work) and has a rank-1 accuracy 
of 55.7%. 

5. CONCLUSION 

In this paper we have explored the benefits of using deep con¬ 
volutional neural networks (CNNs) to extract local features 
which are then modelled using a GMM. Our two-step retrain¬ 
ing procedure provides an effective way to perform dimen¬ 
sionality reduction and provides considerably better perfor¬ 
mance than a simple linear model such as PCA. Compar¬ 
ative experiments show that considerable performance im¬ 
provements can be achieved on the challenging Fish and UEC 
FOOD-100 datasets. 

Future work will examine other ways to retrain the deep 
CNN. For instance, an issue not examined in this work is the 
possibility of extracting thousands of local patches from each 
image and using these samples to retrain the entire network. 

^By closed set we mean that while the data differs between the training 
and testing sets, the classes in both sets are the same. 

^The optimal parameters were (7 = 512 and 5 = 32. 
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